Introduction to machine learning

GEOG 30323

April 21, 2020

Data Science

  • Data science: new(ish) field that has emerged to address the challenges of working with modern data
  • Fuses statistics, computer science, visualization, graphic design, and the humanities/social sciences/natural sciences…

The data analysis process

Visualization vs. modeling

Hadley Wickham (paraphrased):

Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.

Statistical modeling

  • What is the mathematical relationship between an outcome variable \(Y\) and one or more other “predictor” variables \(X_{1}...X_{n}\)?
  • Recall our use of lmplot in seaborn - lm stands for linear model

Statistical modeling

The linear model:

\[ Y = Xb + e \]

where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”

  • Linear models will not always be appropriate for modeling relationships between variables!

Statistics in Python

  • Substantial statistical functionality available in the statsmodels package, available in CoCalc

  • Example: statistical modeling for inference

Inferential statistics in Python

Let’s get an example ready:

Linear regression

Multiple regression

Machine learning

  • “The science of getting computers to act without being explicitly programmed”
  • Types of machine learning algorithms: supervised and unsupervised
  • Topics in machine learning: classification, clustering, regression

Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Supervised learning

  • Supervised machine learning: model “trained” and optimized for predictive power

  • Regression problem: predicting a numeric outcome

  • Classification problem: predicting a categorical outcome]

Regression for prediction

  • Task: predict the median earnings of graduates 10 years after graduation based on a series of college characteristics

  • Method: train a model on a subset of the data, then test the model on the remaining subset

  • Example method: random forest regression

Preparing the model

  • The train_test_split() function splits your data randomly into training (75 percent, by default) and test datasets

Preparing the model

Model diagnostics

Model diagnostics

Random Forest Classifiers

  • Task: predict whether a non-profit college is public or private

Model diagnostics

Feature importance

Making predictions


Unsupervised learning in Python

  • Imports and setup:

Example: K-means clustering

Example: K-means clustering

Example: nearest-neighbor search

How to learn more

  • Take statistics and machine learning courses here at TCU!
  • Check out DataCamp for hundreds of courses on data science in Python and R